"# Develop, Train, Register and Batch Transform Scikit-Learn Random Forest\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"In this notebook we show how to use Amazon SageMaker to train a Scikit-learn Random Forest model, register it in Model Registry, and run a Batch Transform Job. More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:\n",
"\n",
"> Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
"\n",
"Link to the paper: https://doi.org/10.1016/S0167-7152(96)00140-X"
"bucket = sess.default_bucket() # this could also be a hard-coded bucket name\n",
"\n",
"print(\"Using bucket \" + bucket)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Prepare data\n",
"\n",
"We use the California housing dataset.\n",
"\n",
"More info on the dataset:\n",
"\n",
"This dataset was obtained from the `StatLib` repository. http://lib.stat.cmu.edu/datasets/\n",
"\n",
"The target variable is the median house value for California districts.\n",
"\n",
"This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people)."
"The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on premise, etc.). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script"
"## Launching a SageMaker training job with the Python SDK\n",
"\n",
"We will train two models: the first with 100 epochs, and the second with 300 epochs. The number of epochs has no specific meaning. We are interested in training two models, so we will be able to register each one of them into SageMaker Model Registry."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Launch the 1st training job\n",
"\n",
"Once we've defined our estimator, we can specify the hyperparameters we'd like to tune and their possible values.\n",
"This time we will train with 100 epochs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# We use the Estimator from the SageMaker Python SDK\n",
"## Register the model of the 1st training job in the Model Registry\n",
"Once the model is registered, you will see it in the Model Registry tab of the SageMaker Studio UI. The model is registered with the `approval_status` set to \"Approved\". By default, the model is registered with the `approval_status` set to `PendingManualApproval`. Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API."
"Call `describe_model_package` to see the details of the model version. You pass in the ARN of a model version that you got in the output of the call to `list_model_packages`."
"Let's inspect the output locations of both Batch Transform jobs in S3. You can see they have different locations due to their separate Batch Transform jobs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_1_transformer.output_path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"sklearn_2_transformer.output_path"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Conclusion\n",
"\n",
"In this notebook you successfully downloaded the California housing dataset and trained a model using SageMaker Python SDK.\n",
"Then you created a `ModelPackageGroup`, registered the Model Version in SageMaker Model Registry, and triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3.\n",
"\n",
"You trained another model, this time with 300 epochs, registered this Model Version in SageMaker Model Registry, viewed the model versions, and again, triggered a SageMaker Batch Transform Job to process the evaluation dataset from S3. \n",
"\n",
"As next steps, you can try registering your own model in SageMaker Model Registry, and run a SageMaker Batch Transform Job on data you have on S3."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"